feat(endpoints): Add OpenAI Responses API endpoint with fixes and integration tests#43
feat(endpoints): Add OpenAI Responses API endpoint with fixes and integration tests#43acere wants to merge 32 commits intoawslabs:mainfrom
Conversation
… test suite - Add ResponseEndpoint and ResponseStreamEndpoint classes for OpenAI Responses API support - Implement non-streaming and streaming response handling with proper error management - Add structured output support with response format validation and serialization - Create comprehensive unit test suite covering response parsing, error handling, format validation, model parameters, payload parsing, properties, and serialization - Add integration tests for Bedrock response endpoint functionality - Export new response endpoint classes from endpoints module - Update integration test configuration with response endpoint fixtures
- Rename max_tokens to max_output_tokens in create_payload (Response API parameter name) - Fix _parse_response to handle usage=None (Bedrock Mantle) and use input_tokens/output_tokens with fallback to prompt_tokens/completion_tokens - Rewrite _parse_stream_response to process typed events (response.output_text.delta, response.completed) instead of the old chunk-with-output-array format - Fix test_response_bedrock.py to use ResponseUsage attribute names (input_tokens/output_tokens) - Add integration tests for ResponseEndpoint and ResponseStreamEndpoint - Add example notebook for Response API on Bedrock - Update all unit test mocks to match new behavior
llmeter/endpoints/__init__.py
Outdated
| ResponseEndpoint, | ||
| ResponseStreamEndpoint, |
There was a problem hiding this comment.
Shouldn't these be OpenAIResponseEndpoint and OpenAIResponseStreamEndpoint for consistency with the existing ChatCompletion ones? "ResponseEndpoint" seems very generic.
llmeter/endpoints/openai_response.py
Outdated
| try: | ||
| client_response = self._client.responses.create(**payload) | ||
| except APIConnectionError as e: | ||
| logger.error(e) |
There was a problem hiding this comment.
In bedrock_invoke and litellm we're using logger.exception(e), which also prints the stack trace... I'd suggest we standardize on one or the other when handling endpoint invocation errors into InvocationResponse.error_outputs?
There was a problem hiding this comment.
go logger.exception(e)!
llmeter/endpoints/openai_response.py
Outdated
| max_output_tokens: int = 256, | ||
| instructions: str | None = None, | ||
| **kwargs, | ||
| ) -> Dict: |
There was a problem hiding this comment.
Unlike boto3, the OpenAI Python SDK has pretty solid (and TypedDict-based) typings already... Should we even be creating this convenience method in LLMeter? Or just typing payload as ResponseCreateParams for this endpoint and encouraging users to build it via the OpenAI SDK directly?
(Same logic would apply to the existing ChatCompletions endpoint too)
There was a problem hiding this comment.
updated all openAI endpoint classes to leverage the SDK typing. that simplified some of the parsing gymnastics. I'm not in favor of sun-setting create_payload. it's not a hard requirement to create payloads using this method, but it offers an easy consistent way to create tests across providers.
tests/integ/test_response_bedrock.py
Outdated
| # Configure OpenAI client with Bedrock Mantle endpoint for Response API | ||
| # Response API uses bedrock-mantle endpoint, not bedrock-runtime | ||
| base_url = f"https://bedrock-mantle.{aws_region}.api.aws/v1" | ||
| client = OpenAI(api_key=token, base_url=base_url) |
There was a problem hiding this comment.
Looks like this is just testing the OpenAI SDK and not the LLMeter endpoint??
Same for the streaming test below too
|
Also almost forgot - we should add the relevant module placeholder .md under docs api reference |
- Replace Poetry with uv in GitHub Actions PyPI workflow for faster builds - Update .gitignore to track uv.lock instead of poetry.lock - Migrate pyproject.toml from Poetry format to standard PEP 621 format - Update CONTRIBUTING.md with uv installation and development instructions - Update README.md with uv installation examples for both basic and extras - Simplify dependency management and build configuration - Improve CI/CD performance and developer experience with uv tooling
…poetry in the documentation. update test documentation.
- Upgrade astral-sh/setup-uv action from v4 to v7 - Update Python version requirement from <3.13 to <4 in pyproject.toml - Add reference to tests/README.md in CONTRIBUTING.md for testing documentation - Align with uv package manager migration and improve version flexibility
Use importlib.metadata to read the version from installed package metadata, with a fallback to "0.0.0" when the package is not formally installed. This fixes `AttributeError: module 'llmeter' has no attribute '__version__'`.
Use the __name__ variable to retrieve LLMeter's version from importlib, rather than hard-coding the module's name.
Update test payloads and JMESPath expressions in test_bedrock_invoke.py to match Amazon Nova's native Invoke API format, since the default BEDROCK_TEST_MODEL was changed from Claude to Nova in PR awslabs#36. - Non-streaming: use output.message.content[0].text, usage.outputTokens - Streaming: use contentBlockDelta.delta.text, metadata.usage.*Tokens - Request payload: use schemaVersion messages-v1 and inferenceConfig Fixes awslabs#38
onnxruntime 1.24.3 dropped Python 3.10 support, causing the release workflow to fail. Bump the build environment to Python 3.12.
uv build only needs the build backend (hatchling), which it resolves on its own. Installing all dev/test dependencies is unnecessary and was pulling in onnxruntime which lacks Python 3.10 wheels.
Still lots of gaps to fill in
…nd fix build warnings - Add metrics and statistics page with LLM latency concepts (TTFT, TTLT, TPOT), percentile reliability guidance, run-level stats, cost metrics, and visualization examples - Add API reference pages for callbacks (base, cost, mlflow) and bedrock_invoke endpoint - Update installation page with uv instructions, mlflow extra, OpenAI-compatible API description - Fix broken relative links in index.md and key_concepts.md - Add type annotations to fix all griffe warnings in mkdocs build - Fix docstring issues (parameter name mismatch, indentation) in base.py and runner.py - Pin mkdocs<2 to avoid incompatible upstream changes - Add callbacks card to homepage
Move overall homepage within the User Guide instead of a confusing separate tab. Add an API Reference home page.
We don't have github discussions enabled anyway
Add headers to module pages so they don't appear as 'index'. Add some clarifying text to API reference home page. Add some missing pages and fix associated griffe type warnings. Improve some docstrings.
As discussed at https://fpgmaas.com/blog/collapse-of-mkdocs/, MkDocs has been unmaintained for some time and the new v2 will not support Material for MkDocs that we used to use for theming. Migrate to Zensical, a project by the authors of Material for MkDocs team that aims to offer easy compatibility. Also, update the docs GitHub workflow to reflect our moves Poetry->UV and MkDocs->Zensical.
Include section in contributing file to guide devs on how to preview and maintain the documentation website.
Remove custom analytics placeholder page. Fill out 'run experiments' placeholder page. Move unnecessarily folder-nested user guide pages up to the root (URL won't change if we folder them again in future when we have more content).
Add push trigger for main branch with path filters on docs/** and mkdocs.yml so documentation updates are deployed without waiting for a release.
…dencies The docs build only needs mkdocstrings and zensical. Using --only-group instead of --group skips the main project dependencies (torch, mlflow, nvidia packages, etc.) that are not needed for static doc generation.
The deploy-pages action requires id-token: write to obtain the ACTIONS_ID_TOKEN_REQUEST_URL needed for authentication.
- Include .github/workflows/docs.yml in path filter so workflow changes also trigger a docs build - Add id-token: write permission required by deploy-pages action - Use --only-group docs to skip unnecessary main dependencies
- Add environment declaration required by deploy-pages action - Use --frozen on uv sync and --no-sync on uv run to prevent re-installing the full project dependencies during build
- Configure mkdocstrings Python handler with Google-style docstring parsing, source links, cross-references, and merged __init__ docs - Add missing prompt_utils API reference page and nav entry - Fix table column width issues causing awkward word splits in code tokens by keeping inline code on one line and setting min-width on description columns - Update CONTRIBUTING.md with lightweight docs build instructions using uv sync --only-group docs
Going back to sorting attributes alphabetically in the API doc for easier searching.
uv version without --no-sync modifies pyproject.toml and triggers an automatic sync, resolving and installing all 280+ dependencies unnecessarily in the publish workflow.
- Rename ResponseEndpoint -> OpenAIResponseEndpoint and ResponseStreamEndpoint -> OpenAIResponseStreamEndpoint for consistency with OpenAICompletionEndpoint naming convention - Change logger.error() to logger.exception() for stack trace consistency with bedrock_invoke.py and litellm.py - Rewrite test_response_bedrock.py to test LLMeter endpoint wrappers instead of raw OpenAI SDK - Update serialization test assertions for new class names - Update example notebook references
- Add docs/reference/endpoints/openai_response.md placeholder - Add openai_response to mkdocs.yml nav under endpoints - Update connect_endpoints user guide to mention Response API endpoints
- Type invoke() payload as CompletionCreateParams / ResponseCreateParams - Type create_payload() return as SDK TypedDicts using cast() - Replace jmespath with plain list comprehension in _parse_payload - Rewrite stream parsers using typed ChatCompletionChunk / event types, removing all hasattr/getattr fallbacks and type: ignore comments - Make OpenAIResponseStreamEndpoint inherit from OpenAIResponseEndpoint, deduplicating _parse_payload and create_payload - Use collections.abc.Sequence instead of typing.Sequence
Summary
Adds the OpenAI Responses API endpoint support for LLMeter, with fixes to align with the actual API behavior.
Changes
Endpoint fixes (
llmeter/endpoints/openai_response.py)max_tokenstomax_output_tokensincreate_payload(Response API parameter name)_parse_responseto handleusage=None(Bedrock Mantle doesn't always return it) and useinput_tokens/output_tokenswith fallback toprompt_tokens/completion_tokens_parse_stream_responseto process typed events (response.output_text.delta,response.completed) instead of the old chunk-with-output-array formatIntegration tests
tests/integ/test_response_endpoint.py— integration tests forResponseEndpointandResponseStreamEndpointwrappers against Bedrock Mantletests/integ/test_response_bedrock.pyto useResponseUsageattribute names (input_tokens/output_tokens)Unit test updates
spec-based usage mocks (input_tokens/output_tokens) and event-based streaming mocksExample notebook
examples/LLMeter with OpenAI Response API on Bedrock.ipynbdemonstrating non-streaming and streaming usage with Runner and plottingTesting